End-to-End Machine Learning with H2O DSC5.0 Tutorial, Part 1 (bit.ly/dsc50_h2o_tutorial)

1 Agenda

  • 16:15 to 17:00 Set Up & Introduction
  • 17:00 to 17:45 Regression Example
  • 17:45 to 18:00 Coffee Break
  • 18:00 to 18:45 Classification Example
  • 18:45 to 19:15 Bring Your Own Data + Q&A

2 Set Up

2.1 Download -> bit.ly/dsc50_h2o_tutorial

  • scripts/setup.R: install packages required
  • rmd/tutorial_pt1.Rmd: RMarkdown file with introduction and regression code
  • rmd/tutorial_pt2.Rmd: RMarkdown file with classification code
  • scripts/tutorial_pt1.R: R Script file with introduction and regression code
  • scripts/tutorial_pt2.R: R Script file with classification code
  • tutorial.html: this webpage
  • Full URL https://github.com/woobe/useR2019_h2o_tutorial (if bit.ly doesn’t work)

2.2 R Packages

  • Check out setup.R
  • For this tutorial:
    • h2o for machine learning
    • mlbench for Boston Housing dataset
    • DALEX, ibreakDown, ingredients & pdp for explaining model predictions
  • For RMarkdown
    • knitr for rendering this RMarkdown
    • rmdformats for readthedown RMarkdown template
    • DT for nice tables

3 Introduction

This is a hands-on tutorial for R beginners. It will demonstrate the use of H2O and other R packages for end-to-end - automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O.ai library. They will also be able to explain the model outcomes with various methods.

It is a workshop for R beginners and anyone interested in machine learning. RMarkdown and the rendered HTML will be provided so everyone can follow without running the code.

4 Regression Part One: H2O AutoML

4.1 Data - Boston Housing from mlbench

Source: UCI Machine Learning Repository Link

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • b: 1000(Bk - 0.63)^2 where Bk is the proportion of people of African American descent by town.
  • lstat: lower status of the population (percent).
  • medv (This is the TARGET): median value of owner-occupied homes in $1000s.

4.2 Define Target and Features

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"  
ml_overview

4.3 Start a local H2O Cluster (JVM)

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 hours 26 minutes 
    H2O cluster timezone:       Europe/Belgrade 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.26.0.2 
    H2O cluster version age:    3 months and 20 days !!! 
    H2O cluster name:           H2O_started_from_R_branko_gkt443 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.73 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.2 (2018-12-20) 

4.4 Convert R dataframe into H2O dataframe

4.6 Cross-Validation

CV

4.7 Baseline Models

  • h2o.glm(): H2O Generalized Linear Model
  • h2o.randomForest(): H2O Random Forest Model
  • h2o.gbm(): H2O Gradient Boosting Model
  • h2o.deeplearning(): H2O Deep Neural Network Model
  • h2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model from DMLC

4.7.1 Baseline Generalized Linear Model (GLM)

H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  23.04256
RMSE:  4.800267
MAE:  3.307191
RMSLE:  NaN
Mean Residual Deviance :  23.04256
R^2 :  0.7076243
Null Deviance :32617.3
Null D.o.F. :410
Residual Deviance :9470.494
Residual D.o.F. :396
AIC :2487.815
H2ORegressionMetrics: glm

MSE:  28.87315
RMSE:  5.373374
MAE:  3.859189
RMSLE:  0.1861469
Mean Residual Deviance :  28.87315
R^2 :  0.7254239
Null Deviance :10402.21
Null D.o.F. :94
Residual Deviance :2742.949
Residual D.o.F. :80
AIC :621.075

Let’s use RMSE

RMSE

4.7.3 Comparison (RMSE: Lower = Better)

4.8 Manual Tuning

4.8.1 Check out the hyper-parameters for each algo

4.8.3 Comparison (RMSE: Lower = Better)

4.9 H2O AutoML

4.9.1 Leaderboard

4.9.2 Best Model (Leader)

Model Details:
==============

H2ORegressionModel: stackedensemble
Model ID:  StackedEnsemble_BestOfFamily_AutoML_20191116_120246 
NULL


H2ORegressionMetrics: stackedensemble
** Reported on training data. **

MSE:  0.4180159
RMSE:  0.6465415
MAE:  0.5062811
RMSLE:  0.03140634
Mean Residual Deviance :  0.4180159



H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  7.834856
RMSE:  2.799081
MAE:  1.874201
RMSLE:  0.1211643
Mean Residual Deviance :  7.834856

4.9.3 Comparison (RMSE: Lower = Better)

4.10 Make Predictions

   predict
1 36.43380
2 17.12354
3 21.21855
4 18.26257
5 17.64718
6 17.80730

5 Regression Part Two: XAI

Let’s look at the first house in h_test

5.2 Package DALEX

DALEX

5.2.1 The explain() Function

The first step of using the DALEX package is to wrap-up the black-box model with meta-data that unifies model interfacing.

To create an explainer we use explain() function. Validation dataset for the models is h_test from part one. For the models created by h2o package we have to provide custom predict function which takes two arguments: model and newdata and returns a numeric vector with predictions.

5.2.2 Explainer for H2O Models

Preparation of a new explainer is initiated
  -> model label       :  Random Forest 
  -> data              :  95  rows  13  cols 
  -> target variable   :  95  values 
  -> predict function  :  custom_predict 
  -> predicted values  :  numerical, min =  9.674672 , mean =  23.89372 , max =  46.382  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -11.4882 , mean =  0.3315435 , max =  9.9948  
  -> model_info        :  package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression (  default  ) 
  A new explainer has been created!  
Preparation of a new explainer is initiated
  -> model label       :  Deep Neural Networks 
  -> data              :  95  rows  13  cols 
  -> target variable   :  95  values 
  -> predict function  :  custom_predict 
  -> predicted values  :  numerical, min =  11.47347 , mean =  25.9381 , max =  52.32421  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -25.15236 , mean =  -1.712836 , max =  8.32301  
  -> model_info        :  package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression (  default  ) 
  A new explainer has been created!  
Preparation of a new explainer is initiated
  -> model label       :  XGBoost 
  -> data              :  95  rows  13  cols 
  -> target variable   :  95  values 
  -> predict function  :  custom_predict 
  -> predicted values  :  numerical, min =  8.771681 , mean =  24.31138 , max =  50.43563  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -20.09026 , mean =  -0.08612126 , max =  9.010389  
  -> model_info        :  package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression (  default  ) 
  A new explainer has been created!  
Preparation of a new explainer is initiated
  -> model label       :  H2O AutoML 
  -> data              :  95  rows  13  cols 
  -> target variable   :  95  values 
  -> predict function  :  custom_predict 
  -> predicted values  :  numerical, min =  8.667347 , mean =  24.35622 , max =  49.94388  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -13.3916 , mean =  -0.130959 , max =  9.976462  
  -> model_info        :  package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression (  default  ) 
  A new explainer has been created!  

5.2.3 Variable importance

Using he DALEX package we are able to better understand which variables are important.

Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.

This method is implemented in the ingredients::feature_importance() function.

5.2.4 Partial Dependence Plots

Partial Dependence Plots (PDP) are one of the most popular methods for exploration of the relation between a continuous variable and the model outcome. Function variable_response() with the parameter type = “pdp” calls pdp::partial() function to calculate PDP response.

Let’s look at feature rm (no. of rooms)

5.2.5 Prediction Understanding

The function break_down() is a wrapper around the iBreakDown package. Model prediction is visualized with Break Down Plots, which show the contribution of every variable present in the model. Function break_down() generates variable attributions for selected prediction. The generic plot() function shows these attributions.

6 Coffee Break 17:45 - 18:00

coffee_break

2019-11-16